Development and Optimization of Parallel Code for Large-Scale Petroleum Reservoir Simulation

نویسندگان

  • S. A. KHASHAN
  • D. O. OGBE
  • T. M. JIANG
چکیده

This paper discusses the use of large field-scale reservoir simulation to model multiphase fluid flow processes that occur in giant oil reservoirs. The goal of large-scale studies is to model fluid flow with sufficient details to account for reservoir heterogeneity, various reservoir-wellbore configurations, and complex fluid-rock interactions. In this paper, we developed a black-oil reservoir simulator for distributed-memory parallel environment. We ported serial code of a black-oil model to the CRAY T3E and IBM SP2 systems. We analysed the code and benchmarked the performance. To parallelize the code, we used a domain decomposition algorithm, whereby the reservoir is divided into several subdomains, with each subdomain assigned to a separate processor element (PE). The message-passing interface (MPI) is used to exchange information across subdomains. We validated the parallel simulator using data from the Society of Petroleum Engineers comparative solution projects. Because the linear equation solver accounts for over 90% of the CPU time in a typical reservoir simulation run, we evaluated the performance of several parallel algorithms in the project, including LSOR, Gauss-Siedel (GS) and strongly implicit procedure (SIP). We found that the convergence behaviour of the solver depends on the number of processors and on the permeability anisotropy of the reservoir. For the problems we tested, the SIP algorithms provided the best performance. We compared the computational efficiency of the parallel code against the serial code using models containing up to 350,000 grid blocks in 4-, 8-, 1632-, 64-, and 80-PE environments. The paper discusses programming and computational performance issues in large-scale reservoir simulation for parallel systems. PEER REVIEWED PAPER (“REVIEW AND PUBLICATION PROCESS” CAN BE FOUND ON OUR WEB SITE) load balancing among the different processors. Accordingly, the global data array of a variable is partitioned among the processor nodes. However, to handle problems of data dependency whereby data required by one processor belongs to local memory of neighboring processor, we created extra rows or columns of the partitioned data arrays. Thus, to ensure full concurrency, the partitioned data array in each processor is padded with extra rows or columns of data that are needed to localize the computation. With this data partitioning strategy, the processor associated with each subdomain carry out its computations independently. Then, the (padded) overlapped data are communicated (exchanged) between neighboring processors to enforce the boundary conditions at the dividing interfaces. Figure 1 shows the domain decomposition used in this work. Given the reservoir geometry, one can decompose its physical domain in one or more dimensions. An efficient domain decomposition algorithm is one that incurs the minimum cost of communication. Efficient inter-processor communications involve fewer messages that carry as much contiguous data as possible. In our work, the large amount of data that are overlapped and needed to be exchanged along subdomain interfaces favoured decomposition of the problem in one dimension. In reservoir problems, the number of grids (layers) along the depth is generally small compared to that of the other dimensions. Therefore, the vertical dimension (k-direction) was not considered for parallel implementation. An efficient decomposition will be along the dominant dimension. If the dimensions were of a comparable size, the well locations should play a deciding role in the dimension to decompose. For example, horizontal and slanted wells impose some additional restrictions on the problem. It is recommended that the entire wellbore be contained in a single subdomain, preferably not on the subdomain’s boundary. Locating the wellbore across two subdomains means that the computations required to handle variations in the flow field and the wellbore-reservoir interactions are carried out by more than one processor. This may affect the accuracy or even the convergence of the solution due to the interim decoupling imposed by the specified domain decomposition strategy. One may have to relax the equal subdomains preference in favour of ensuring that the whole wellbore is not portioned across subdomains. For the test problems dealt with in this study, we decomposed the reservoir in the j-direction parallel to major axis of the horizontal wells. However, care must be taken for decomposition in either dimension so that the loop indexing would not conflict with Fortran’s column-wise memory allocation sequence. Otherwise, the discontinuity in memory access may impact code performance and offset any advantage offered by parallel processing. Figure 1 shows the decomposition in j-direction. Figure 2 shows the overlapping boundary data exchanged between two neighboring processors. Input-Output Data Requirements and Communication Strategy We use MPI, a Message Passing Interface standard library, to handle the communications between processors. The large-scale input/output (I/O) data arrays are handled using a preprocessor and postprocessor. The preprocessor reads all input data (except recurrent data). The global arrays of input data are read, partitioned, and then distributed to the specified number of processors using the decomposition strategy discussed in this study. The recurrent data, however, is read in the main code, one line at a time, by a master processor, which simultaneously broadcasts each line of data to all other processors. Communication is handled by setting-up global and local node numbering (indexing) systems for the nodes in the parallelized dimension. The global index refers to data elements across the global domain, and permits convenient treatment of boundary conditions, well calculations, and post-processing of results for visualization. The local numbering system is local to each processor. This approach simplifies the code parallelization and provides a mechanism for treating the overlapping (shared) nodes at the processor interfaces. If the global pressure array (for example) has the dimension of P(Nx, Ny, Nz), the partitioned pressure array in all q participating processors is redimensioned as P(Nx, 0:Nyl + 1, Nz), where Nyl = Ny/q. The local index in the parallelized dimension starts at js = 1 and ends at je = Nyl. The padded rows (“ghost data”) at the north and south of the subdomain are referenced 34 Journal of Canadian Petroleum Technology FIGURE 1: Domain decomposition in j-direction. FIGURE 2: Domain decomposition in j-direction showing data layout across the subdomain interfaces. using the index js – 1 and je + 1, respectively. The MPI primitive, MPI_Sendrecv, is used to exchange data between neighboring processors using the local numbering. Linear Iterative Solver For IMPES formulation, the domain decomposition described above has no significant effect (if any) on the rate of convergence of the outer iteration used to compute the saturations. The formulation essentially involves temporal data dependencies, rather than recursive data dependency. Variables like saturations and physical properties are updated concurrently after exchanging data along the subdomains boundaries, and then transferred into the memory of the current processor. As in all reservoir simulations, the solver consumes most of the CPU time needed for both calculations and inter-processor communications of the parallel code. We redesigned the iterative solvers in the original serial code to cope with the distributed memory paradigm. The linearized system of equations is given in matrix form as: ....................................................................................................(1) We use the Strongly Implicit Procedure, SIP, a widely used incomplete factorization, ILU-type method. For SIP, the coefficient matrix A in Equation (1) is factorized (preconditioned) into an approximate matrix M, where .................................................................................................(2) L and U are constructed to approximate the sparse matrix pattern, similar to that of the lower and upper parts of A. These parts are not the exact factorization of A because extra fills (four fills in 3D problems) were circumvented by interpolations. The correction equation at the kth iteration is ..................................................................................(3) where e is the correction vector (xk+1 – xk) and ρ is the residual vector. The solution to the system represented by Equation (3) can obtained by successively solving (forward substitution) the following .................................................................................................(4) Then, we solve (backward substitution) ....................................................................................................(5) where z is an intermediate vector. All of these steps, LU factorization, forward substitution, and backward substitution, contain recursive operations that hinder any straightforward parallelization of the code. The parallelization of such recursive methods like SIP in a distributed memory environment requires distribution of the grid normal to the index planes, and it demands significantly complicated computations for the flux integrals and numerous point-to-point (fine grid) communications. Also, the performance will be limited to that of pipelining, in which excessive CPU time is wasted as many processors are idle, waiting for data that is yet to be updated on a neighboring processor. In distributed-memory parallelism, memory management requires efficient use of local memory that can be allocated to a single processor. Therefore, pipelining computations in ILU-type methods complicate coding and impose high computational overhead, due to the indirect indexing which limits contiguous memory access. In this study, we endeavor to fully distribute (privatize) the array of data and to keep the data arrays from being replicated on many processors (except those at the overlapped interfaces). The domain decomposition described in the previous section results in the global matrix being split into a system of submatrices. The off diagonal submatrices (not Aii) represent the coupling (data dependency) along the ith subdomain. The strong coupling between data residing at the interfaces can be relaxed by explicitly using data available from previous iterations. Diagonal submatrices (Aii) are factorized into L and U matrices. The iterative scheme on the ith th subdomain is then(3): ....................................(6) The iterations start with x(0) obtained from the previous outer iteration or from the previous time step. Upon the completion of one inner iteration on all processors, the updated variable x(1)is then exchanged across subdomain boundaries, and this step is repeated until convergence. The dependency between subdomains at the global domain level resembles that of Jacobi-iteration type, while it is recursive within the subdomain itself. It is expected that such approximation (decoupling) will have a profound effect on the convergence rate. This effect is addressed in the next section.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Selection of an Optimal Hybrid Water/Gas Injection Scenario for Maximization of Oil Recovery Using Genetic Algorithm

Production strategy from a hydrocarbon reservoir plays an important role in optimal field development in the sense of maximizing oil recovery and economic profits. To this end, self-adapting optimization algorithms are necessary due to the great number of variables and the excessive time required for exhaustive simulation runs. Thus, this paper utilizes genetic algorithm (GA), and the objective...

متن کامل

Parallel computation for reservoir thermal simulation of multicomponent and multiphase fluid flow

We consider parallel computing technology for the thermal simulation of multicomponent, multiphase fluid flow in petroleum reservoirs. This paper reports the development and applications of a parallel thermal recovery simulation code. This code utilizes the message passing interface (MPI) library, overlapping domain decomposition, and dynamic memory allocation techniques. Its efficiency is inve...

متن کامل

Well Placement Optimization Using Differential Evolution Algorithm

Determining the optimal location of wells with the aid of an automated search algorithm is a significant and difficult step in the reservoir development process. It is a computationally intensive task due to the large number of simulation runs required. Therefore,the key issue to such automatic optimization is development of algorithms that can find acceptable solutions with a minimum numbe...

متن کامل

Fracture Dynamic Propagation Model of High-Energy Gas Fracturing for Casing Perforated Well

The onshore oil and natural gas industries of China have started a large-scale development when crude oil reserves have been difficult to recover. The stratum fracture modification is an indispensable measure to efficiently develop oil and gas fields. Hydraulic fracturing is the most important reservoir stimulation technique, but it is still faced with various problems such as the failure to fr...

متن کامل

A Novel Assisted History Matching Workflow and its Application in a Full Field Reservoir Simulation Model

The significant increase in using reservoir simulation models poses significant challenges in the design and calibration of models. Moreover, conventional model calibration, history matching, is usually performed using a trial and error process of adjusting model parameters until a satisfactory match is obtained. In addition, history matching is an inverse problem, and hence it may have non-uni...

متن کامل

Mixed Large-Eddy Simulation Model for Turbulent Flows across Tube Bundles Using Parallel Coupled Multiblock NS Solver

In this study, turbulent flow around a tube bundle in non-orthogonal grid is simulated using the Large Eddy Simulation (LES) technique and parallelization of fully coupled Navier – Stokes (NS) equations. To model the small eddies, the Smagorinsky and a mixed model was used. This model represents the effect of dissipation and the grid-scale and subgrid-scale interactions. The fully coupled NS eq...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002